{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "## Открытый курс по машинному обучению\n",
    "<center>Автор материала: Плаксина Елена Константиновна, Levka."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <center>Обзор библиотеки для генерации временных признаков tsfresh</center>\n",
    "### <center>Time Series FeatuRe Extraction based on Scalable Hypothesis tests</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Библиотека используется для извлечения признаков из временных рядов. Практически все признаки, которые могут прийти вам в голову, уже внесены в расчёт этой библиотеки и нет никакого смысла создавать их самому, когда это можно сделать парой строчек кода из библиотеки.\n",
    "\n",
    "Извлечённые признаки могут быть использованы для описания или кластеризации временных рядов. Также их можно использовать для задач классификации/регрессии на временных рядах."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://tsfresh.readthedocs.io/en/latest/_images/introduction_ts_exa_features.png\" width=\"700\" height=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Процесс расчёта признаков состоит из двух этапов:\n",
    "- Расчёт всех возможных признаков\n",
    "\n",
    "```python\n",
    "from tsfresh import extract_features\n",
    "extracted_features = extract_features(timeseries, column_id=\"id\", column_sort=\"time\")\n",
    "```\n",
    "- Отбор релевантных признаков и удаление константных/нулевых признаков\n",
    "\n",
    "```python\n",
    "from tsfresh import select_features\n",
    "from tsfresh.utilities.dataframe_functions import impute\n",
    "\n",
    "impute(extracted_features)  # удаление константных признаков\n",
    "features_filtered = select_features(extracted_features, y)  # отбор признаков\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Процедура отбора признаков\n",
    "#### Стадия 1\n",
    "Расчёт признаков\n",
    "#### Стадия 2\n",
    "Проверка на значимость каждого признака, расчёт p-value\n",
    "#### Стадия 3\n",
    "Поправка на множественную проверку гипотез Бенджамини-Иекутиели"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:18:21.873848Z",
     "start_time": "2018-04-24T18:18:21.870343Z"
    }
   },
   "source": [
    "<img src=\"http://tsfresh.readthedocs.io/en/latest/_images/feature_extraction_process_20160815_mc_1.png\" width=\"700\" height=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Приведём пример генерации признаков на основе датасета Human Activity Recognition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:35:06.604259Z",
     "start_time": "2018-04-24T18:35:06.592741Z"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pylab as plt\n",
    "\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from tsfresh import (extract_features, extract_relevant_features,\n",
    "                     select_features)\n",
    "from tsfresh.examples.har_dataset import (download_har_dataset,\n",
    "                                          load_har_classes, load_har_dataset)\n",
    "from tsfresh.feature_extraction import ComprehensiveFCParameters\n",
    "from tsfresh.utilities.dataframe_functions import impute"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Загрузка и отрисовка данных**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:32:59.112311Z",
     "start_time": "2018-04-24T18:32:40.608378Z"
    }
   },
   "outputs": [],
   "source": [
    "download_har_dataset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:32:59.373712Z",
     "start_time": "2018-04-24T18:32:59.113813Z"
    }
   },
   "outputs": [],
   "source": [
    "df = load_har_dataset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:35:03.282511Z",
     "start_time": "2018-04-24T18:35:03.191846Z"
    }
   },
   "outputs": [],
   "source": [
    "plt.title('accelerometer reading')\n",
    "plt.plot(df.iloc[0,:])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Извлечение признаков**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:37:44.151891Z",
     "start_time": "2018-04-24T18:37:44.148384Z"
    }
   },
   "outputs": [],
   "source": [
    "# расчёт только определённого набора параметров, заданного в ComprehensiveFCParameters\n",
    "extraction_settings = ComprehensiveFCParameters()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:40:36.723372Z",
     "start_time": "2018-04-24T18:40:36.712868Z"
    }
   },
   "outputs": [],
   "source": [
    "# переформируем данные 500 первых показаний сенсоров column-wise, как этого требует формат библиотеки\n",
    "N = 500\n",
    "master_df = pd.DataFrame({0: df[:N].values.flatten(),\n",
    "                          1: np.arange(N).repeat(df.shape[1])})\n",
    "master_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:41:33.544564Z",
     "start_time": "2018-04-24T18:40:57.405324Z"
    }
   },
   "outputs": [],
   "source": [
    "X = extract_features(master_df, column_id=1, impute_function=impute, default_fc_parameters=extraction_settings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:42:22.275690Z",
     "start_time": "2018-04-24T18:42:22.272685Z"
    }
   },
   "outputs": [],
   "source": [
    "\"Число рассчитанных признаков: {}.\".format(X.shape[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Обучение классификатора**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:45:36.391311Z",
     "start_time": "2018-04-24T18:45:36.384788Z"
    }
   },
   "outputs": [],
   "source": [
    "y = load_har_classes()[:N]\n",
    "y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:45:43.110969Z",
     "start_time": "2018-04-24T18:45:43.100963Z"
    }
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:45:47.906376Z",
     "start_time": "2018-04-24T18:45:47.813708Z"
    }
   },
   "outputs": [],
   "source": [
    "cl = DecisionTreeClassifier()\n",
    "cl.fit(X_train, y_train)\n",
    "print(classification_report(y_test, cl.predict(X_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Отберём признаки для каждого класса отдельно и решим задачу бинарной классификации**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:49:57.768306Z",
     "start_time": "2018-04-24T18:49:43.341016Z"
    }
   },
   "outputs": [],
   "source": [
    "relevant_features = set()\n",
    "\n",
    "for label in y.unique():\n",
    "    y_train_binary = y_train == label\n",
    "    X_train_filtered = select_features(X_train, y_train_binary)\n",
    "    print(\"Number of relevant features for class {}: {}/{}\".format(label, X_train_filtered.shape[1], X_train.shape[1]))\n",
    "    relevant_features = relevant_features.union(set(X_train_filtered.columns))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:50:01.727717Z",
     "start_time": "2018-04-24T18:50:01.724211Z"
    }
   },
   "outputs": [],
   "source": [
    "len(relevant_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Мы уменьшили количество признаков с 794 до 264."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:50:10.141904Z",
     "start_time": "2018-04-24T18:50:10.136905Z"
    }
   },
   "outputs": [],
   "source": [
    "X_train_filtered = X_train[list(relevant_features)]\n",
    "X_test_filtered = X_test[list(relevant_features)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:50:16.432864Z",
     "start_time": "2018-04-24T18:50:16.429357Z"
    }
   },
   "outputs": [],
   "source": [
    "X_train_filtered.shape, X_test_filtered.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:50:24.694811Z",
     "start_time": "2018-04-24T18:50:24.649243Z"
    }
   },
   "outputs": [],
   "source": [
    "cl = DecisionTreeClassifier()\n",
    "cl.fit(X_train_filtered, y_train)\n",
    "print(classification_report(y_test, cl.predict(X_test_filtered)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Качество модели практически не изменилось, однако модель стала намного проще."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Сравнение с классификатором на стандартных признаках**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:53:10.252470Z",
     "start_time": "2018-04-24T18:53:10.248964Z"
    }
   },
   "outputs": [],
   "source": [
    "X_1 = df.iloc[:N,:]\n",
    "X_1.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:53:11.053575Z",
     "start_time": "2018-04-24T18:53:11.049569Z"
    }
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X_1, y, test_size=.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-24T18:53:17.756822Z",
     "start_time": "2018-04-24T18:53:17.720767Z"
    }
   },
   "outputs": [],
   "source": [
    "cl = DecisionTreeClassifier()\n",
    "cl.fit(X_train, y_train)\n",
    "print(classification_report(y_test, cl.predict(X_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Как видимо, качество модели значительно улучшилось по сравнению с наивным классификатором."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
